TRELLIS+: An Effective Approach for Indexing Genome-Scale Sequences Using Suffix Trees

نویسندگان

Benjarath Pupacdi

Mohammed J. Zaki

چکیده

UNLABELLED With advances in high-throughput sequencing methods, and the corresponding exponential growth in sequence data, it has become critical to develop scalable data management techniques for sequence storage, retrieval and analysis. In this paper we present a novel disk-based suffix tree approach, called TRELLIS+, that effectively scales to massive amount of sequence data using only a limited amount of main-memory, based on a novel string buffering strategy. We show experimentally that TRELLIS+ outperforms existing suffix tree approaches; it is able to index genome-scale sequences (e.g., the entire Human genome), and it also allows rapid query processing over the disk-based index. AVAILABILITY TRELLIS+ source code is available online at http://www.cs.rpi.edu/-zaki/software/trellis

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Indexing huge genome sequences for solving various problems.

Because of the increase in the size of genome sequence databases, the importance of indexing the sequences for fast queries grows. Suffix trees and suffix arrays are used for simple queries. However these are not suitable for complicated queries from huge amount of sequences because the indices are stored in disk which has slow access speed. We propose storing the indices in memory in a compres...

متن کامل

RepMaestro: scalable repeat detection on disk-based genome sequences

MOTIVATION We investigate the problem of exact repeat detection on large genomic sequences. Most existing approaches based on suffix trees and suffix arrays (SAs) are limited either to small sequences or those that are memory resident. We introduce RepMaestro, a software that adapts existing in-memory-enhanced SA algorithms to enable them to scale efficiently to large sequences that are disk re...

متن کامل

Speeding up Index Construction with Gpu for Dna Data Sequences

The advancement of technology in scientific community has produced terabytes of biological data. This datum includes DNA sequences. String matching algorithm which is traditionally used to match DNA sequences now takes much longer time to execute because of the large size of DNA data and also the small number of alphabets. To overcome this problem, the indexing methods such as suffix arrays or ...

متن کامل

Constructing Genome Scale Suffix Trees

Suffix trees have been the focus of significant research interest as they permit very efficient solutions to a range of string and sequence searching problems. Given a suffix tree that encodes a particular string, it is possible to solve problems such as searching for a specific pattern in time proportional to the length of the pattern rather than the length of the string. Suffix trees can also...

متن کامل

Suffix trees for inputs larger than main memory

A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than the input sequences and quickly outgrow the main memory, the first attempts at building large suffix trees focused on algo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

دوره شماره

صفحات -

تاریخ انتشار 2008

TRELLIS+: An Effective Approach for Indexing Genome-Scale Sequences Using Suffix Trees

نویسندگان

چکیده

منابع مشابه

Indexing huge genome sequences for solving various problems.

RepMaestro: scalable repeat detection on disk-based genome sequences

Speeding up Index Construction with Gpu for Dna Data Sequences

Constructing Genome Scale Suffix Trees

Suffix trees for inputs larger than main memory

عنوان ژورنال:

اشتراک گذاری